Lab book: PCA outputs, whole genomes

Liam Brierley (University of Liverpool)
2020-05-01

PCAs based on genome composition for 3539 coronavirus whole genome sequences. PCAs are colour-coded and ellipses drawn based on different outcome variables, though underlying PCA for each bias type is the same. Mouseover gives outcome variable and virus name.

Dinucleotide bias // genus

As for spikes, strong separation, but scattered betacovs. Outlying cetacean viruses on left??

Dinucleotide bias // human infection capability

As for spikes, tight clusters of individual viruses.

Codon bias (RSCU) // genus

Stronger separation of genera than spikes, likely because stop codons are not so strongly weighted. Alphacovs still seem to prefer TGA, but not so clear for other genera.

Codon bias (RSCU) // human infection capability

Epidemic coronaviruses fairly distant from endemic coronaviruses here, unlike spikes. Not as strong a pattern of preference for stop codons for all human coronaviruses either (unlike spikes). Not worth looking at without stop codons as they’re not overly influential here.

Codon bias (RSCU) without stop codon // human infection capability // PCA3 and PCA4

However if excluding stop codons, similar pattern emerges to spikes when examining PC3/4, even though they explain small amount of variance, good separation on human vs nonhuman viruses. Real signal or just noise…??

Amino acid bias // genus

Unlike spikes, very strong separation when PCA accounts for whole genome.

Amino acid bias // human infection capability

As for spikes, reasonable separation between epidemic coronaviruses and endemic coronaviruses!